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Abstract 

Context-free  grammars  provide  the  basis  for  many  useful  tools  such  as  parser- 
generators,  compiler-compilers  and  syntax-directed  editors.  This  paper  demonstrates 
the  potential  benefits  obtained  when  context-free  grammars  are  used  to  define  complex 
objects  in  the  relational  model.  The  grammar  formalism  facilitates  relational  queries 
on  the  hierarchical  structure  of  these  objects  and  promotes  the  use  of  grammar- based 
tools  as  front  ends  to  relational  database  systems. 

Keywords:  Advanced  Applications,  Data  Models,  Query  Languages,  Complex  Objects 


1  Introduction 

Several  research  projects  ([BK85,CNR90,Lin84,LKM+85,RY85,Row89],  among  others)  pro¬ 
vide  relational  database  support  for  complex  objects  such  as  engineering  part  descriptions, 
text,  and  software  programs.  The  schema  for  these  projects  is  defined  using  the  relational 
schema  definition  language  and,  due  to  the  nature  of  complex  objects,  the  data  of  a  given 
object  is  often  distributed  over  several  relations.  These  relational  schema  definitions  do 
not  indicate  how  the  relational  data  is  derived  from  the  original  objects  or  how  it  can  be 
combined  to  reconstruct  the  original  objects.  This  information  must  be  reflected  in  data 
extraction  tools  and  queries  that  are  written  by  the  user.  Furthermore,  the  resulting  rela¬ 
tions  only  store  aspects  about  the  objects  that  are  of  interest  to  the  current  application. 
This  poses  a  problem  for  query  evolution:  As  the  application  evolves  and  new  queries  arise, 
additional  information  about  the  objects  (which  was  most  likely  part  of  the  original  speci¬ 
fication)  must  be  gathered  and  stored  in  the  database.  The  data  extraction  tools  must  be 
constantly  rewritten  to  reflect  these  changes. 

‘This  research  was  sponsored  partially  by  the  National  Science  Foundation  under  Grant  IRI-8719458,  by 
the  Air  Force  Office  for  Scientific  Research  under  Grant  AFOSR-89-0303,  and  by  the  University  of  Maryland 
Institute  of  Advanced  Computer  Studies. 
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Grammars  have  been  described  as  a  useful  representation  for  data  structures  [GT83] 
and  hierarchical  structures  in  information  [GT87].  A  formal  description  of  a  grammatical 
database  model  is  described  in  [GPV89].  However,  they  do  not  discuss  how  this  model  is 
realized  in  a  relational  database  system. 

This  paper  proposes  one  translation  of  grammatically  defined  complex  objects  into  re¬ 
lational  schema  and  demonstrates  how  the  grammar  definitions  form  a  basis  for  tools  that 
support  the  population  and  manipulation  of  the  database.  Section  2  formally  describes  a 
grammatical  schema  definition  language  and  gives  an  algorithm,  GeneRel,  that  translates 
the  grammatical  schema  to  relational  schema  under  which  these  objects  are  stored.  Section  3 
describes  a  tool-generator,  GeneParse,  that  generates  parsers  that  populate  the  database, 
illustrating  the  potential  for  building  tool-generators  whose  products  support  database  op¬ 
erations  for  these  objects.  Support  for  a  grammar  catalog,  described  in  section  4,  is  obtained 
by  applying  GeneParse  to  a  grammar  that  defines  the  structure  of  the  input  grammars  (the 
meta-grammar).  GeneRel  and  GeneParse  are  implemented  as  semantic  actions  of  the  meta- 
grammar,  and  the  implementation  of  the  grammar  catalog  is  generated  by  applying  these 
tools  to  the  meta-grammar  itself.  Section  5  discusses  the  issue  of  querying  these  objects, 
introducing  three  graph  operators  that  facilitate  the  manipulation  of  hierarchical  structures. 
The  computation  of  each  operator  is  contained  in  the  fixed  point  of  a  set  of  queries  that 
are  derivable  from  the  grammar  specifications.  Finally,  Section  6  discusses  issues  of  query 
evolution  that  are  related  to  efficiency. 

2  Grammatical  Schema  Definition  Language 

The  grammatical  schema  definition  language  we  derive,  tagged  context-free  grammars 
(TCFGs),  is  a  variant  of  context-free  grammars  that  incorporates  the  concept  of  tokens, 
adapts  the  closure  notations  from  regular  expressions,  and  includes  tag-names  for  uniquely 
identifying  symbols  in  the  grammar. 

Tokens  facilitate  the  grouping  of  terminal  characters  into  single  entities  [HU79].  TCFGs 
have  two  classes  of  tokens:  delimiters  -  tokens  whose  domain  consists  of  a  single  value,  and 

lexicons  —  tokens  whose  domain  consists  of  more  than  one  value.  Furthermore,  lexicons  and 

nonterminals  are  referred  to  collectively  as  nondelimiter  symbols.  A  lexical  analyzer  that 
returns  tokens  and  their  values  must  accompany  each  grammar. 

Closure  notation ,  used  in  regular  expressions,  is  convenient  for  representing  repeating 
structures.  Kleene  closure  applied  to  a  symbol  (i.e.  x*)  represents  all  strings  that  are  a  con¬ 
catenation  of  zero  or  more  occurrences  of  strings  derivable  from  the  symbol  (i.e.  0,  x,  xx,  xn); 
positive  closure  applied  to  a  symbol  (i.e.  x+ )  represents  the  same  set  of  strings  as  kleene  clo- 
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sure  minus  the  null  string  0.  We  incorporate  this  notation  into  TCFGs  because  it  encourages 
the  utilization  of  the  set  retrieval  aspects  of  the  relational  query  languages. 

Tag-names ,  inspired  by  [MN88],  allow  the  user  to  specify  meaningful  names  for  the  gen¬ 
erated  relations  and  attributes.  All  occurrences  of  nondelimiter  symbols  (i.e.  nonterminals 
and  lexicons)  are  tagged.1 

Definition  1  A  tagged  context-free  grammar  (TCFG)  is  a  7-tuple  E  =  (S,  V,  L,  D,  R,  A.  P) 
where  V  is  a  finite  set  of  nonterminals;  L  is  a  finite  set  of  lexicons;  D  is  a  finite  set  of 
delimiters;  R  is  a  set  of  production  tag-names;  A  is  a  set  of  non-delimiter  tag-names;  V, 
L ,  and  D  are  disjoint;  S  G  V  is  the  special  start  symbol;  and  P  is  a  set  of  productions. 
Productions  are  classified  as  either  constructors  or  lists.  A  constructor  production  has  the 
form: 


<  r  :  N  >  wi  <  ai  :  Ni  >  . .  ,wk  <  ajt  :  Nk  >  wk+i, 
and  a  list  production  has  the  form: 

<  r  :  N  >  —  <  ai  :  Ni  >D 

where  r  is  a  production  tag-name,  N  is  a  nonterminal,  wj  is  a  possibly  empty  string  of 
delimiters,  ai  is  a  non-delimiter  tag-name,  Ni  is  a  nonterminal  or  lexicon,  and  □  is  either  * 
or  +.  ■ 


Example  1  Mechanical  engineers  wish  to  provide  database  support  for  information  about 
parts  in  a  manufacturing  resource  planning  (MRP)  system  [HY88].  The  MRP  system  keeps 
a  part  master  record  for  each  part  which  contains  the  part  number,  textual  specification, 
and  other  auxiliary  data  such  as  unit  of  measure  and  leadtime.  Parts  are  either  purchased 
or  manufactured.  Information  about  the  supplying  vendor  and  cost  is  recorded  for  each 
part  that  is  purchased.  A  bill  of  material  is  kept  for  each  manufactured  part,  which  contains 
the  quantity  of  each  subpart  that  is  required  to  manufacture  the  part.  The  following  TCFG 
captures  this  information: 


<pmr  :part> 

< vendor :type> 
<bom:type> 
<subparts : subpQty> 


<p#:int>  <descr:str>  <uom:int> 
Cleadtime : int>  <partType :type> 
CvendorName : str>  <cost:int> 
<subpartQty : subpQty>+ 

<subpart  :part>  <qty:int> 


This  example  contains  four  productions  with  production  tag-names  pmr,  vendor,  bom,  and 
subparts  for  the  three  nonterminals  part,  type,  and  subpartQty.  Notice  that  the  produc¬ 
tion  tag-names  vendor  and  bom  differentiate  between  type  information  for  purchased  and 


1Note  that  this  restriction  need  not  be  inflicted  on  the  user;  we  have  built  an  automatic  tagger  that 
generates  TCFGs  from  YACC  grammar  specifications. 


manufactured  parts  respectively.  There  is  a  non-delimiter  tag-name  for  each  occurrence 
of  a  nonterminal  or  lexicon  on  the  right-side  of  the  productions.  There  must  be  a  lexical 
analyzer  that  returns  tokens  for  lexicons  int  and  str  and  their  corresponding  values,  and 
would  also  be  responsible  for  returning  tokens  for  delimiters  if  there  were  any.  This  TCFG 
is  recursive,  since  a  manufactured  part’s  bill  of  material  must  contain  at  least  one  subpart 
(indicated  by  positive  closure)  which  is,  in  turn,  a  part.  ■ 

GeneRel  is  an  algorithm  that  translates  a  TCFG  into  relational  schema  definitions. 

Definition  2  A  relational  schema  definition  has  the  form: 

create  r(a*  :  di  [not  null], . . . ,  a*  :  d*  [not  null]) 

where  r  is  the  relation  name ,  the  ajs  are  attribute  names  (which  are  unique  within  the 
definition),  the  dis  define  the  domains  of  their  corresponding  attributes,  and  not  null  is 
a  string  which  is  attached  to  each  attribute  that  participates  in  the  key  of  the  relation.  ■ 

Each  nonterminal  and  lexicon  symbol  in  the  TCFG  has  a  corresponding  domain.  The 
domain  N  contains  a  surrogate2  for  each  derivation  of  the  nonterminal  N  stored  in  the 
database.  The  domain  L  represents  the  syntactic  category  defined  by  the  lexicon  L. 

One  relation  scheme  is  generated  for  each  production  in  the  TCFG.  The  form  of  the 
relation  schemes  generated  for  the  two  types  of  productions  is  similar  and  is  summarized 
in  Figure  1.  Each  generated  relation  inherits  its  name  from  the  production  tag-name  of  the 
corresponding  production.  It  has  one  attribute  for  each  non-delimiter  symbol  on  the  right- 
side  of  the  production;  the  attribute’s  name  is  the  same  as  the  non-delimiter’s  tag-name 
and  the  attribute’s  domain  is  a  domain  that  corresponds  to  the  non-delimiter  symbol.  Each 
relation  has  an  attribute  named  occur  which  is  defined  over  the  domain  that  corresponds  to 
the  left-side  nonterminal  of  the  production.  Relations  representing  list  productions  have  an 
additional  position  attribute  that  indicates  the  order  between  elements  in  the  same  list. 
The  key  for  relations  generated  from  constructor  productions  consists  of  the  single  attribute 
occur;  the  key  for  relations  generated  from  list  productions  consists  of  the  attribute  pair 
(occur,  position). 

Example  2  The  relational  schema  definitions  generated  from  the  TCFG  in  Example  1  are: 


2  We  assume  the  relational  model  ([Cod79])  extended  with  domains  of  surrogates  as  described  in  [HOT76], 
Surrogates  are  system  generated  internal  identifiers  that  are  ideal  for  representing  unnamed  objects. 
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Production 

Relation 

Constructor 

<  r  :  N  >  — ►  Wl  <  ai  :  Ni  >  . . .  wk  <  ak  :  Nk  >  wk+1 

r(occur:N,ai  ak  :  Nk) 

List 

<  r  :  N  >  —  <  a*  :  Ti1  >a 

r(occur  :  N,  a.j  :  N^, position  :  counter) 

Figure  1:  GeneRel 


create  pmr (occur : part  not  null,  p#:int,  descr:str, 
uom:int,  leadtime : int ,  partType :type) 
create  vendor (occur : type  not  null,  vendorName : str ,  cost: int) 
create  bom(occur :type  not  null,  subpartQty :subpQty, 
position: counter  not  null) 

create  subparts (occur :subpQty  not  null,  subpart : part ,  qty:int) 


There  are  four  productions  in  Example  1,  so  four  relations  are  generated.  There  is  a  do¬ 
main  of  surrogates  (part,  type,  subpQty)  corresponding  to  each  of  the  three  nonterminals, 
and  the  lexical  domains  int  and  str  contain  all  values  that  are  in  the  syntactic  category 
returned  by  the  lexical  analyzer  for  the  corresponding  lexicons.  The  relation  bom  has  a 
position  attribute  since  it  is  generated  from  a  list  production.  ■ 

3  Database  Population 

Context-free  grammars  provide  the  basis  for  many  extremely  useful  tools  such  as  parser- 
generators,  compiler-compilers  and  editing  environments.  For  instance,  we  have  designed 
and  implemented  a  tool,  GeneParse,  that  generates  parsers  that  populate  the  database. 
This  tool  is  appropriate  in  applications,  such  as  programming  languages,  in  which  objects 
are  defined  by  context-free  grammars  and  are  written  as  sentences  of  the  grammar. 

GeneParse  generates  one  parser,  parser^,  for  each  TCFG.  Sentences  accepted  by  the 
TCFG  can  then  be  parsed  and  stored  under  the  TCFG’s  corresponding  relational  schema. 
Each  production  in  the  TCFG  is  translated  into  an  equivalent  production  in  YACC[Joh78]. 
The  translation  for  constructor  productions  is  straightforward  -  tags  are  removed  and  the 
proper  delimiters  are  used.  The  list  productions  generate  two  YACC  productions:  one  is 
left  recursive  and  the  other  is  either  a  single  symbol  (for  +)  or  the  empty  string  (for  *).  In 
addition,  GeneParse  generates  semantic  actions  that  insert  the  sentences  into  the  database 
when  the  corresponding  productions  are  recognized. 
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stmts 


begin 
X  :=  3 

if  X  ==  4  then 
X  :=  5 
else 

if  X  ==  3  then 
X  :=  4 
else 
X  :=  3 
endif 
endif 
end 
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Figure  2:  GeneParse:  Stores  Data  into  Relations 


Example  3  Software  engineers  are  interested  in  providing  database  support  for  program¬ 
ming  languages,  whose  structures  are  often  formally  specified  by  grammars.  The  following 
TCFG  defines  a  simplified  structured  programming  language. 


<prog:block> 
<stmts : stmtlist> 
< ifstmt : stmt > 


<assign:stmt> 
<equal : cond> 


begin  <body:stmtlist>  end 
<stmt :  stmt>* 
if  <bool:cond> 
then  <trueact :stmtlist> 
else  <f alseact : stmtlist>  endif 
<var:id>  :=  <value:int> 
<var:id>  ==  <value:int> 


Figure  2  shows  a  program  that  is  in  the  language  of  this  TCFG  and  how  it  is  stored 
under  the  corresponding  relations  by  the  parser-1"  generated  by  GeneParse.  Note  that  for 
this  example  we  show  surrogates  that  would  not  normally  be  exposed  to  the  user.  ■ 


4  The  Grammar  Catalog 

The  TCFGs  and  their  sentences  correspond  to  the  schema  and  data  levels  in  the  intension- 
extension  framework  for  DBMSs  presented  in  [Mar85]  and  [MR87].  3  In  this  framework,  a 

3[O’C90]  depicts  the  tasks  that  are  associated  with  the  usage  of  context-free  grammars  in  a  similar 
framework. 
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Figure  3:  Intension-Extension  Framework 

relational  schema  is  the  intension  of  the  (extensional)  data.  TCFGs  and  their  sentences  are 
coupled  to  this  framework  using  GeneRel  and  GeneParse  (figure  3). 

The  middle  level  of  this  framework  is  the  level  of  database  definition  where  the  user 
specifies  TCFGs.  GeneRel  generates  the  corresponding  relational  schemes,  and  GeneParse 
generates  the  corresponding  parser+  which  is  employed  at  the  lowest  level  to  store  the 
sentences  in  the  database.  Additionally,  the  specified  TCFGs  are  parsed  and  stored  in 
a  grammar  catalog.  This  supports,  as  is  often  necessary,  access  to  structural  information 
about  the  data  stored  in  the  database. 

In  our  implementation,  we  support  the  middle  level  of  the  framework  having  only  imple¬ 
mented  the  top  level.  Support  for  the  grammar  catalog  was  generated  by  applying  GeneRel 
and  GeneParse  to  a  TCFG  (the  meta-grammar )  that  describes  the  class  of  TCFGs.  GeneRel 
was  applied  to  the  meta-grammar  to  produce  a  set  of  relation  schemes  under  which  any 


TCFG  -  including  the  meta-grammar  itself  -  can  be  stored.  GeneParse  was  applied  to  the 
meta-grammar  to  produce  the  parser4-  in  the  middle  level  for  storing  the  TCFGs  in  the 
database. 


5  Queries 

The  following  examples  demonstrate  that  several  queries  involving  complex  objects  that  are 
described  by  TCFGs  can  be  expressed  in  current  relational  languages.  4 

Example  4  What  is  the  level  1  explosion  (i.e.  the  immediate  sub-parts)  of  the  manufac¬ 
tured  part  pi? 

select  sub.  p# 

from  super  as  pmr,  bom,  subparts,  sub  as  pmr 

where  super. p#=pl 

and  super ,partType=bom. occur 

and  bom. subpartQty=subparts .occur 

and  subparts . subpart=sub . occur ;  ■ 

Example  5  Which  variables  have  a  value  of  4  assigned  to  them? 

select  var 
from  assign 

where  value  =4;  ■ 

Example  6  What  are  all  the  occurrences  of  statements?5 

(select  occur  from  ifstmt) 
union 

(select  occur  from  assign);  ■ 

However,  queries  on  complex  objects  frequently  involve  some  form  of  recursion  and, 
therefore,  cannot  be  described  by  standard  relational  queries.  Several  research  efforts  extend 
the  expressiveness  of  relational  languages  with  limited  forms  of  recursion  such  as  transitive 
closure  [Agr88],  linear  recursion  [JAN87],  path  algebra  [DS86,Car78],  and  traversal  recursion 
[RHDM86].  A  survey  of  extensions  to  query  languages  to  support  graph  traversal  appears  in 
[MS90].  However,  none  of  these  extensions  allow  the  user  to  express  recursion  that  involves 
multiple  relations.  This  type  of  recursion  is  important  for  complex  objects  since  the  data 
of  a  given  object  is  often  distributed  over  several  relations. 

4  The  examples  in  this  section  are  based  on  the  example  grammars  in  Section  2  and  Section  3. 
^Although  surrogates  cannot  be  printed,  they  may  be  the  intermediate  results  of  queries. 


The  operators  introduced  in  this  section  help  the  user  express  queries  about  sentences 
that  correspond  to  TCFGs.  They  can  be  combined  with  relational  expressions  to  form 
queries  that  associate  information  from  multiple  sentences  even  when  these  sentences  are 
derived  from  different  TCFGs.  The  operators  can  be  expressed  as  a  system  of  simultaneous 
relational  queries  (e.  g.  a  set  of  queries  whose  fixed  point  contains  the  desired  result)  that 
are  generated  from  the  TCFG  at  the  same  time  the  relations  are  defined. 

5.1  Graph  Operators  -  Semantics 

The  semantics  of  the  graph  operators  presented  in  this  section  is  described  in  terms  of  a 
conceptual  directed  acyclic  graph  (DAG)  that  corresponds  to  the  set  of  complex  objects 
stored  in  the  relations.  6  The  operator  derives,  based  on  the  notion  of  =>  from  language 
theory  [HU79],  reconstructs  the  lexical  information  (as  defined  by  the  TCFG)  for  surrogates 
that  represent  the  nodes  in  the  DAG.  The  operators  reach  and  contain  facilitate  the 
extrapolation  of  parent-sibling  relationships  from  graphs  that  span  several  relations.  They 
are  based  on  the  concepts  of  reach  and  inverse  reach  from  graph  theory:  Suppose  G  is  a 
directed  graph  and  there  is  a  path  from  node  j  to  node  k  in  G;  then  k  is  in  the  reach  of  j 
and  j  is  in  the  inverse  reach  of  k. 

The  graph  operators  are  defined  as  follows: 

Let  S  be  a  unary  relation  with  one  attribute  of  surrogate  values  denoting  nodes:  let  N 
be  a  set  of  production  tag-names  representing  node  types. 

derives (S)  is  a  binary  relation  with  attributes  node  and  sentence  representing  the  map¬ 
ping  between  nodes  in  S  and  their  derived  sentences. 

reach(N,  S)  is  a  unary  relation  with  attribute  occur  that  consists  of  the  surrogates  for  all 
nodes  that  are  in  the  reach  of  at  least  one  of  the  nodes  represented  by  the  surrogates 
in  S  and  have  one  of  the  node  types  in  N.  If  N  is  not  specified,  then  all  node  types  are 
considered. 

contain(N,  S)  is  a  unary  relation  with  attribute  occur  consisting  of  the  surrogates  for  all 
nodes  that  are  in  the  inverse  reach  of  at  least  one  of  the  nodes  represented  by  the 
surrogates  in  S  and  have  one  of  the  node  types  in  N.  If  N  is  not  specified,  then  all  node 
types  are  considered. 

The  queries  in  this  paper  are  written  in  the  Starburst  query  language  [HCL+90]  which, 
among  other  features,  enhances  SQL  with  table  expressions  [Dat84]  and  table  functions, 

6  A  given  object  corresponds  to  a  parse  tree,  but  if  sharing  of  sub-objects  is  supported,  the  set  of  objects 
corresponds  to  a  DAG. 
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Table  functions  are  used  to  express  the  graph  operators,  and  table  expressions  are  used 
to  bind  subqueries  as  input  to  the  operators.  A  table  expression,  var  as  query,  binds 
the  variable  var  to  the  subquery  query.  A  table  function,  var  as  tf  (pi, . .  .  ,pn),  binds 
the  variable  var  to  the  table  produced  by  the  function  tf  which  takes  zero  or  more  input 
parameters.  Any  of  these  parameters  can  be  tables.  Table  expressions  and  table  functions 
are  listed  in  the  from  clause  of  a  query,  and  the  variables  are  treated  as  table  names  of  the 
referenced  tables  for  the  duration  of  the  query.  An  example  of  an  implementation  that  uses 
table  functions  to  extend  the  query  language  is  described  in  [WCL91]. 

Example  7  Find  the  part  numbers  of  all  purchased  parts  that  are  needed  in  the  manufac¬ 
tured  part  pi. 

select  r. occur 
from  i  as 

(select  occur 
from  pmr 
where  p#=pl) 

r  as  reach ({VENDOR} ,  i) ;  ■ 

Example  8  Find  the  part  numbers  of  all  the  parts  that  will  be  delayed  if  supplies  of  the 
purchased  part  p7  are  delayed. 

select  pmr.p# 
from  pmr 

where  pmr. occur  in 
(select  c. occur 
from  i  as 

(select  pmr. occur 
from  pmr 
where  pmr.p#=p7) 

c  as  contain({pmr> ,  i));  ■ 

Example  9  All  the  sentences  derived  from  the  nonterminal  STMT  are  obtained  with  the 
following  query: 

select  d. sentence 

from  i  as  (  (select  occur  from  ifstmt) 
union 

(select  occur  from  assign) 

) ,  d  as  derives(i) ;  ■ 

Example  10  The  surrogates  of  all  statements  that  contain  an  assignment  of  the  value  3 
to  a  variable  can  be  expressed  using  the  contain  operator. 
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select  c. occur 

from  i  as  (select  occur  from  assign  where  value  =  3) , 

c  as  contain({stmt>,  i) ;  ■ 

Example  11  The  results  of  contain  and  reach  operations  can  be  combined  with  other 
operators  to  further  enhance  the  query.  Let  the  result  of  the  above  query  be  the  view  P. 
All  conditions  of  if  statements  that  contain  the  assignment  of  the  value  3  to  a  variable  can 
be  retrieved  with  the  following  query: 

select  ifstmt.cond 
from  if stmt,  P 

where  if stmt. occur  =  P. occur;  ■ 

Example  12  The  surrogates  of  all  assignment  statements  that  are  embedded  in  if-statements 
that  have  a  condition  on  the  variable  X  can  be  retrieved  with  the  following  query: 


select  r. occur 

from  i  as  (select  if stmt. occur 
from  equal 
where  var  =  * ‘ X ’ ’ ) , 

r  as  reach({assign},  i) ;  ■ 

5.2  Graph  Operators  -  Computation 

A  computation  for  each  operator  is  derived  from  the  TC'FGs.  During  schema  definition,  a 
set  of  SQL  queries  is  constructed  from  the  TCFG  for  each  operator.  An  expression  involving 
a  graph  operator  is  evaluated  by  computing  the  fixed  point  of  the  set  of  queries  (for  that 
operator)  applied  to  the  given  input  parameter.  The  construction  of  the  set  of  queries  for 
computing  reach  and  contain  is  similar,  so  only  the  computation  of  reach  is  given  here. 
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The  reach  set  of  queries  has  one  query  for  each  production.  The  query,  Kr,  generated 
for  production  with  tag-name  K  has  the  following  form: 

select  *  from  K 
where  occur  in 

(select  I. occur  from  I 
union 

select  {rhstagi}  from  {prodtagi}r 
union 

select  {rhstagn}  from  {prodtagn}r  ) 

where  I  is  the  relation  to  which  the  operation  is  applied,  and  for  each  right-side  occur¬ 
rence  i  of  the  nonterminal  for  production  K,  prodtagi  is  the  tag-name  of  the  production 
containing  this  occurrence  and  rhstagi  is  the  non-delimiter  tag-name  of  i.  To  evaluate 
reach({Ki, . . . ,  Kh},  E)  where  Ki  is  production  tag-name  from  the  TCFG,  the  fixed  point  of 
this  set  of  queries  is  computed  with  E  substituted  for  I,  and  the  relation 
{Ki}r  union  {K2}r,  union  {Kn}r  is  returned. 

Example  13  Suppose  we  have  the  following  grammar: 

<A1:A>  <bl:B>  <cl:C> 

<A2:A>  ->•  <dl:D>* 

<E1  :E>  <b2 :B>  <al:A> 

<F1  :F>  — ?  <c2:C>  <a2:A> 

The  following  queries  are  generated  for  the  reach  operator: 

Alr  =  select  *  from  A1 
where  occur  in 

(select  occur  from  I 
union 

select  al  from  Elr 
union 

select  a2  from  Flr) 

A2r  =  select  *  from  A2 
where  occur  in 

(select  I. occur  from  I 
union 

select  al  from  Elr 
union 

select  a2  from  Flr) 
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Elr  =  (select  *  from  El  where  occur  in  select  I. occur  from  I) 

Flr  =  (select  *  from  FI  where  occur  in  select  I. occur  from  I) 

To  evaluate  reach({Ai,  A2},  E),  compute  the  fixed  point  of  the  reach  system  of  queries 
with  E  substituted  for  I,  and  return  Alr  union  A2r.  ■ 

The  derives  set  of  queries  is  not  as  natural  to  the  relational  model  as  that  of  the  other 
operators  because  of  its  dependence  on  the  position  information  in  the  list  rules.  It  uses  the 
aggregate  operator  max  and  the  string  concatenation  operator  (II).  The  reach  operator  is 
used  to  restrict  the  computation  of  derivations  to  only  those  objects  that  are  components 
of  objects  specified  in  the  input  relation,  I. 

The  set  of  queries  has  one  nonterminal  query  for  each  nonterminal  and  one  list  query 
for  each  list  production.  A  nonterminal  query  for  nonterminal  N  generates  the  temporary 
relation  Nstr(occuri  str),  where  str  is  the  derivation  of  the  object  represented  by  occur. 
Nstr  is  a  union  of  subqueries,  one  for  each  production  for  N.  The  form  of  the  subqueries  is 
dependent  on  the  type  of  the  production.  A  list  query  for  a  list  production  with  production 
tag  name  r  generates  the  temporary  relation  rLStr(°ccur, position,  str)  where  str  is  the 
concatenation  of  the  derivations  for  all  surrogates  including  and  following  the  position^1 
element  in  the  list. 

Recall  the  production  forms  from  section  2.  Let 

•  {<  vi  :  Vi  >,...<  vn  :  Vn  >}  for  n  <  k  represent  all  the  <  ax  :  Nx  >  that  are  non¬ 
terminal  symbols. 

•  {<  1 1  :  Li  > _ <  lm  :  Lm  >}  for  m  <  k  represent  all  the  <  ax  :  Nx  >  that  are  lexicon 

symbols. 

•  axstr  represent  the  string  value  for  <  ax  :  Nx  >,  where 

-  axstr  =  NxStr.str  if  Nx  is  a  nonterminal,  and 

-  axstr  =  r.ax  if  Nx  is  a  lexicon. 

•  reachOfl  =  select  r. occur  from  i  as  I ,  r  as  reach(i) . 

Subqueries  for  constructor  rules  where  n  >  0  have  the  form: 

select  r_ occur,  Uj  I  I  a^str  ||  —  I  I  “k  I  I  a^str-  |  |  wk-|-i 

from  r,  ri  as  Vistr23,  ....  rn  as  V*Str 

where  r.vi  =  r^. occur  and  ...and  r.vn  =  rn. occur ; 


Subqueries  for  constructor  rules  where  n  =  0  have  the  form: 

select  r. occur,  wi  II  aistr  II  .  ..Ilw*  lla^str  ||  w^+i 
from  r,  n  as  Vigtr>  rn  as  Vnstr  ,  reachofl 

where  r.vi  =  ri.  occur  and  ...and  r.vn  =  rn. occur  and  r. occur  =  reachofl  .occur  ; 

The  subquery  that  is  generated  as  part  of  the  nonterminal  query  for  list  queries  has  the  form: 

select  occur,  str 
from  rLstr 

where  position  =  1; 

The  list  query  rLstr  for  a  list  production  where  the  repeating  symbol  is  a  non-terminal 
has  the  form: 


tLstr  =  select  rLstr-occur,  rLstr-Position?  str  -  {Ni}Str.str  ||  rLstr-str 
from  rLstr,  r 

where  r.position  =  rLStr-Position  -  1  and  r.a  =  {N1}str. occur 
union 

select  r. occur,  max(r .position) ,  {Ni}Str.str 
from  r,  {Ni}str  group  by  r. occur 
where  r.a  =  {Ni}str .occur ; 

The  list  query  rLstr  for  a  list  production  where  the  repeating  symbol  is  a  lexicon  has 
the  form: 


rLstr  =  select  rLStr-°ccur,  rLStrTosit:i-oni  str  =  r.a  ||  rLStr-str 
from  rLstr »  r 

where  r.position  =  rLstr .position  -  1 
union 

select  r. occur,  max(r .position) ,  r.a 
from  r,  reachOfl  group  by  r. occur 
where  r. occur  =  reachOfl .occur; 
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Example  14  The  following  is  the  derives  set  of  queries  for  the  language  example  given 
in  Section  3. 

blockstr  =  select  prog. occur,  ‘ ‘begin"  I  I  rl.str  ||  ‘‘end’’ 
from  prog,  rl  as  stmtliststr 
where  prog. body  =  rl. occur; 

stmtstr  =  select  if  stmt,  occur,  “if”  ||  rl.str  || 

‘ ‘then’ ’  | I  r2.str  I  I 
‘ ‘ else ”  ||  r3 . str 

from  if  stmt,  rl  as  condstr.  r2  as  stmtliststr.  r3  as  stmtliststr 
where  if stmt. bool  =  rl. occur  and  if stmt .trueact  =  r2. occur 
and  if stmt .falseact  =  r3. occur 

union 

select  assign. occur ,  assign. var  ||  ”  II  assign. value 

from  assign,  reachOfl 

where  assign. occur  =  reachOf I . occur ; 

stmtliststr  =  select  occur,  str 
from  stmtlistLstr 
where  position  =  1; 

stmtSLstr  =  select  stmts^tr .occur ,  StmtSLstr .position 

str  =  stmts. stmt  ||  stmts^tr •  str 
from  stmtSLstr.  stmts 

where  stmts .position  =  StmtSLstr  .position  -  1 
union 

select  stmts. occur,  max (stmts .position) ,  stmts. stmt 
from  stmts 

group  by  stmts. occur; 

condstr  =  select  equal. occur,  equal. var  ||  “==”  II  equal. value 
from  equal,  reachOf I 
where  equal. occur  =  reachOf I .occur; 


6  Varying  the  Level  of  Decomposition 

As  previously  mentioned,  other  projects  that  provide  database  support  for  complex  objects 
store  only  aspects  about  the  objects  that  are  of  interest  to  the  current  application.  This 
poses  a  problem  for  query  evolution.  As  the  application  evolves  and  new  queries  arise, 
additional  information  about  the  objects  must  be  gathered  and  stored  in  the  database. 
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The  premise  for  our  work  is  that  all  the  information  about  complex  objects,  in  particular 
textual  objects,  is  stored  in  the  database.  If  this  information  is  stored  fully  decomposed 
and  only  accessed  as  a  whole,  there  is  clearly  a  lot  of  unnecessary  overhead  required  to 
reconstruct  the  information,  (i.e.  the  computation  of  derives  from  Section  5). 

We  suggest  that  the  level  of  decomposition  for  any  complex  object  should  reflect  the 
level  of  access  needed  to  support  the  applications  that  are  querying  this  information.  The 
combination  of  GeneRel,  GeneParse,  and  derives  provides  this  flexibility.  Information 
for  which  component  fields  are  not  being  accessed  can  be  composed.  If,  in  the  future,  it 
is  necessary  to  access  the  components  of  these  fields,  the  information  needed  to  parse  the 
composed  fields  is  contained  in  the  stored  TCFG.  Furthermore,  if  the  component  fields  are 
being  accessed  frequently,  relations  for  storing  these  fields  can  be  generated  and  the  data 
in  these  fields  can  be  decomposed  into  the  new  relations. 

The  level  of  decomposition  is  specified  by  a.  set  of  composite  nonterminals.  The  decom¬ 
position  of  a  composite  nonterminal,  N,  can  be  automated.  The  procedure  that  decomposes 
N  must  perform  the  following: 

1.  apply  derives  to  the  meta-grammar  to  regenerate  the  TCFG  from  the  Catalog 

2.  let  G(j  represent  a  TCFG  with  start  symbol  N  that  contains  all  productions  reachable 
from  N  in  the  original  TCFG: 

•  apply  GeneRel  to  Gn  to  define  the  new  relations  needed  to  support  the  decom¬ 
position  of  N 

•  apply  GeneParse  to  Gjj  to  generate  a  sub-parser +  for  parsing  and  storing  the 
composed  lexical  fields  of  N 

3.  for  every  production  that  has  N  on  the  right-side,  generate  a  new  relation  with  the 
same  structure  as  the  old  relation  replacing  the  the  lexical  domain  N  with  a  domain 
of  surrogates, 

4.  update  the  parser4-  with  insertion  statements  for  storing  of  future  sentences,  and 

5.  define  views  to  support  queries  that  previously  accessed  composed  fields  for  N. 

Clearly,  their  is  an  inverse  procedure  for  building  composite  nonterminals  from  decomposed 
information. 

Example  15  Imagine  an  application  that  maintains  a  database  of  names  and  addresses, 
described  by  the  TCFG  that  follows. 

<pinfo:info>  —>■  <pname :name>  <paddress : address> 

<usa: address>  —  <street : street >  <state : state>  <zip :zipCode> 
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In  the  early  stages  of  the  application,  the  database  was  only  required  to  print  the  ad¬ 
dresses,  so  address  was  added  to  the  list  of  composite  nonterminals,  and  the  information 
above  was  stored  in  the  relation: 

pinfo(occur : info,  pname:name,  paddress : address) 

where  info  is  a  set  of  surrogates,  name  is  a  lexical  domain  for  storing  names,  and  address 
is  a  domain  that  is  defined  by  the  above  production  for  address. 

It  then  became  necessary  to  form  queries  that  require  access  to  the  state  and  zipcode 
fields  of  the  addresses.  The  new  relations  generated  to  handle  this  scenario  were: 

pinfo/(occur : info ,  pname:name,  paddress : address/) ; 
usafusa: address/,  street :street ,  state:state,  zip : zipCode) ; 

where  address/  is  now  a  set  of  surrogates,  and  street,  state,  and  zipCode  are  lexical 
domains. 

The  following  steps  must  be  taken  to  decompose  the  data  in  the  database  and  to  provide 
support  for  queries  that  accessed  the  old  relations. 

for  each  tuple  P  in  pinfo 

{ 

parse  P. address  and  store  in  usa; 

let  S  be  the  surrogate  of  the  stored  tuple; 

insert  into  pinfo/ 

values  (occur  =  P. occur,  name  =  P.name,  address  =  S) ; 

} 

delete  relation  pinfo; 
create  view  pinfo  = 

select  occur,  name,  address=d. sentence 

from  pinfo/,  a  as  (select  occur  from  usa),  d  as  derives  (a) 
where  pinfo/. occur  =  a. occur; 

Future  applications  can  now  form  queries  that  access  the  states  and  zipcodes  of  the  ad¬ 
dresses.  Meanwhile,  the  existing  applications  that  access  pinfo  will  not  be  invalid,  but 
supported  by  the  view  pinfo.  ■ 


7  Future  Work 

Several  other  projects  have  developed  grammatical  models  for  describing  complex  objects 
[GT83,GT87,GPV89,CCRZ+90],  built  tool-generators  [MKN89],  employed  grammar  spec¬ 
ifications  to  support  database  operations  [Loh87,RB8‘2],  incorporated  relational  concepts 
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to  enhance  language  specifications  [HT86,Hor90,C’CRML88],  and  used  relational  databases 
to  store  complex  objects  [BK85,CNR90,Lin84,LKM+85,RUTP85,RY85,Row89].  To  our 
knowledge,  no  one  else  has  attempted  to  map  grammatical  descriptions  of  objects  into  the 
relational  model  and  to  use  the  structural  information  contained  in  these  descriptions  to 
facilitate  the  manipulation  of  the  stored  data. 

There  are  still  several  issues  that  must  be  addressed  to  understand  the  full  potential 
and  practicality  of  our  approach. 

•  GeneRel  is  one  possible  mapping  from  TCFGs  to  relational  schema.  Is  there  a  better 
representation  for  the  generated  relations?  One  which  collapses  recursive  queries  that 
span  several  relations  into  transitive  closure  queries  within  a  single  relation?  This 
would  allow  us  to  take  advantage  of  the  existing  formalisms  for  expressing  transi¬ 
tive  closure  [Agr88,JAN87,Car78,RHDM86,KB88]  and  the  known  techniques  for  the 
efficient  management  of  transitive  closure  [ABJ89,VB86]. 

•  GeneParse  is  a  tool  that  facilitates  the  population  of  the  database  for  textual  objects. 
Other  tools  must  be  developed  that  support  the  update  of  objects  in  the  database  and 
the  specification  of  shared  subobjects  (a  requirement  for  any  system  that  supports 
complex  objects  [BKKG88]).  The  Exodus  data  model  [CDV88]  provides  support 
for  distinguishing  between  shared  and  non-shared  fields  that  can  be  adapted  to  our 
environment.  Can  shared  subobjects  be  specified  through  a  data  editor  generated  by 
an  editor-generator? 

•  The  three  graph  operators  demonstrate  that  the  database  can  utilize  the  structural 
information  from  the  grammatical  descriptions.  There  must  be  additional  facilities 
for  expressing  queries  that  relate  information  within  a  complex  object.  We  are  very 
interested  in  investigating  the  utilization  of  a  modified  attribute  grammar  formalism 
[Hor90]  for  expressing  queries  that  relate  information  within  a  complex  object  and 
combining  these  results  with  relational  languages  to  relate  information  between  several 
complex  objects. 

•  GeneRel  and  GeneParse  facilitate  varying  levels  of  decomposition.  How  should  the 
level  of  decomposition  be  specified?  Can  it  be  adjusted  during  database  operation  by 
analyzing  query  usage  patterns  [Rou8'2]? 

•  Grammars  are  the  basis  for  several  software  tools  such  as  syntax-directed  editors 
[RT85]  and  data  translation  [MKN89,  RC89].  Can  these  tools  be  employed  in  a 
database  environment?  For  example,  how  useful  are  syntax-directed  editors  for  edit¬ 
ing  database  objects?  In  the  scenario  described  in  section  6,  can  they  be  used  to 
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ensure  that  the  data  conforms  to  the  relational  schema  and  to  enforce  the  structure 
of  the  domains  of  composed  nonterminals? 

This  paper  has  described  one  realization  of  a  grammatical  schema  definition  language  in 
the  relational  model  that  has  been  implemented  using  meta-description,  parser  generators, 
and  attribute  grammars.  The  generation  of  the  graph-operator  computations  demonstrated 
that  the  grammar  descriptions  contain  more  information  about  the  objects  stored  in  the 
database  than  previous  flat  relational  descriptions.  Furthermore,  efficiency  issues  concerning 
fully  decomposed  objects  can  be  alleviated  using  GeneParse  and  GeneRel  to  vary  the  level 
of  decomposition. 
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